Summary

In this project, we have centered our exploratory data analysis around a segment of our population, households with children. We’ll be investigating interesting topics such as factors that influence children’ presence in families, states with abnormally high or low number of children and much more. Given our American community surveys on individual households, we’ve choosen numeric and categorical variables like Residing State, family type and employment status, languages spoken in the family and household income. Additionally, we have also created artifical new variables that extract and combind valuable information.

Inital data overview

## 'data.frame':    308611 obs. of  20 variables:
##  $ NOC     : int  2 1 2 2 1 2 2 1 2 1 ...
##  $ NP      : int  4 3 4 4 3 4 4 3 4 4 ...
##  $ BDSP    : int  1 3 3 3 2 3 4 2 4 3 ...
##  $ RMSP    : int  3 5 8 5 5 4 11 4 8 11 ...
##  $ RNTP    : int  920 750 400 NA NA NA NA 590 NA 2900 ...
##  $ VALP    : int  NA NA NA 100000 94000 20000 240000 NA 2000 NA ...
##  $ FES     : int  2 5 2 1 5 2 2 1 8 1 ...
##  $ FINCP   : int  97000 25000 45060 98000 50000 10000 116000 41200 15500 135600 ...
##  $ GRPIP   : int  12 50 12 NA NA NA NA 23 NA 29 ...
##  $ HHL     : int  4 1 1 1 1 1 1 5 1 1 ...
##  $ HINCP   : int  97000 25000 45060 98000 77000 10000 116000 41200 53500 135600 ...
##  $ TAXP    : int  NA NA NA 8 10 1 17 NA 1 NA ...
##  $ WKEXREL : int  3 11 3 1 10 6 2 1 15 4 ...
##  $ WORKSTAT: int  3 10 3 1 10 6 3 1 15 1 ...
##  $ WATP    : int  250 480 2 30 300 420 1 180 50 30 ...
##  $ GASP    : int  3 3 3 60 30 3 3 3 170 20 ...
##  $ FULP    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ ELEP    : int  40 250 50 140 140 120 190 170 160 300 ...
##  $ name    : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ abbr    : Factor w/ 52 levels "AK         ",..: 2 2 2 2 2 2 2 2 2 2 ...

Our dataset consists of 19 variables and total of 308611 households.

Children across U.S. states

In the first part, we will take a look at the average number of children in households across the country.

Map of number of children in household (here dark blue means high birth rate and white means low)

As the plot show above, discrepences between states are obivious and there are clear clusters of states where average number of own children is praticularly high or low. For instance, new england states are themself a clusters as the number of children in those states is low. While looking at the mid-west, there is darker cluster around WY where NOW is as high as 1.9 to 2.4.

Top 5 states with highest and lowest NOC

State or district Avg Number of children Rank
Utah 2.4 1
Idaho 2.2 2
Alaska 2.1 3
North Dakota 2.0 4
South Dakota 2.0 5
………… …. ..
Rhode Island 1.75 47
Vermont 1.76 48
West Virginia 1.76 49
New Hampshire 1.72 50
D.C. 1.66 51

Univariate Analysis Section

Family Type and Employment Status

The bubble chart above shows the mean NOC under different family type and employment status. From it, we can find that Married-couple, only husband in Labor Force that has a biggest mean number of childern.

In the following part, I use ggplot to show the percentage and total NOC under different family type and employment status. From these plots, we can find that the family type that “Only Husband in Labor Force” tend to have more number of children. The family type that “Female in Labor Force” tend to have less number of children.

Based on the first part, I try to compare the Family Type and Employment Status in states which have more children and states which have less children.

Compare these two graphs, we can find that States, which have higher number of children, “Both in Labor Force” and “Only husband in Labor force” have larger percentage. However, states, which have lower number of children, “Both in Labor Force” and “Female in Labor Force” have larger percentage. Hence, we can conclude that Family Type and Employment Status will affect the number of children. The comparison in different states confirm my finding that “Only Husband in Labor Force” tend to have more children. “Female in Labor Force” tend to have less children. This finding is consistent with our assumption. Because we need money to raise babies and females take more responsibilities for raising children.

Income per person

A artifical variable was created as income per person. (Household income divided by number of persons in household) We want to explore if income per person is a factor that guides the decision of births. Here we will first investigate the distribution of income per person.

md <- df
md$IPP <- md$HINCP / md$NP
md$NOC <- as.factor(md$NOC)
levels(md$NOC) = c('1', '2', '3+', '3+','3+','3+','3+','3+','3+','3+','3+','3+','3+','3+')
md$NOC <- factor(md$NOC, levels = c('1', '2', '3+'))

ggplot(data=md, aes(md$IPP)) + 
  geom_histogram(breaks=seq(0, 120000, by =5000), 
                 col="red", 
                 aes(fill=..count..)) +
  scale_fill_gradient("Count", low = "green", high = "red") +
        xlab('Income per person')

Income per person is skewed heavily to the right as we expected. A very large percentage of our population lives with $50,000 a year.

This interesting graph of three variables, number of children, income per person and rent as perccentage of household income shows us two rather distinct clusters. Red or 1 kid individuals appear to be on the ‘surface’. In another word, even for persons who spend the same 40, 50, or even 60% of their income on rent, the more the persons earn the less children they have. This is an very intersting phenomenon.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Historgram of income per person group by number of children. The three groups share close lower boundaries while upper boundaries appear to be larger for families with more children. This doesn’t contracdict with the previous graph we see, because familes with 1 child out weights the other groups by great margin.

Household income

In order to understand the relationship between number of children in the household with respect to the household income, firstly we plotted the histogram of the variable. Except for some extremely wealthy families, most of the households obtain annual income from 0 to $25,000.

Then we generated the box plots for different NOC levels. Those wealthy households are detected as outliers in the box plot. It is interesting to see that the extremely wealthy families simply tend to have one or two children, rather than raising as many children as they can afford. In addition, it seems that the average level of household income decreases as the number of children increases. The output is much clearer when we remove those extreme values.

Further more, we tried to calculate the average number of children for different household income to further verify our hypothesis. The following graph demonstrates the lowess smoothing line of the data table. There is a significant decreasing trend when the household income is greater than $50,000, while the tendency is not that clear in the remaining part.

To tested our hypothesis, we used the household income from the ten states we selected previously, five with large NOC and five with small NOC. For simpleness an effectiveness of displaying the graph, we created class intervals and put our data points into their respective bins. The bin length is $1,000. And we calculated the average number of children within each bins. Different colors, grey and red, are used to distinguish the source of these household income, and the size reflects the magnitude of average number of children. Fixing the number of children, the household income in the states with less children seems to be a little be greater than that in the other five states.

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Except for the graphical analysis, we applied statistical test to support our result. Firstly we conducted F test to compare two variance of the household income from two group of states. The small P-value indicates the inequality of the variance.

## 
##  F test to compare two variances
## 
## data:  hdata_st1$HINCP and hdata_st2$HINCP
## F = 0.67797, num df = 7276, denom df = 3662, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6407689 0.7169477
## sample estimates:
## ratio of variances 
##           0.677968
## 
##  Welch Two Sample t-test
## 
## data:  hdata_st1$HINCP and hdata_st2$HINCP
## t = -2.0828, df = 6223.1, p-value = 0.03731
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7021.3734  -212.6871
## sample estimates:
## mean of x mean of y 
##  85151.78  88768.81

Hence, we did the Welch’s modified two-sample t test to test the mean. The null hypothesis is that two groups of states have same average household income. We reject the null hypothesis under 0.05 significance level. Combined with our graphical plot, we may infer that the household income and number of children are negatively correlated, even though the relationship is not that strong.

Living costs

We investigated the relationship between the number of children(NOC) and the House cost ratio(calculated by 6 variables: gross rent as percentage of household income (GRPIP), monthly rent(RNTP), gas monthly cost(GASP), electricity monthly cost(ELEP) , fuel cost yearly(FULP), water yearly cost (WATP)).

(1)The house cost ratio(HCR) and its level First, we selected the columns and cleaned the data.

Then we calculated the house-related cost ratio(HCR) based on the equation: HCR=GRPIP(12(RNTP+GASP+ELEP)+WATP+FULP)/(RNTP*12)

After we get the HCR, plot to check the overall distribution: we can conclude from the graph that 70% of the house-related cost stays in the range between 12.5% and 50%.

Then we ploted the box plot:

What can be inferred from this graph is that: the ratio increase slightly with the number of children. But there is not a clear relationship between these 2 variables. Thus,we decided to subset diffirent kinds of house cost ratio into 5 levels: ###Low, medieum, intermediate high, high and extremely high

Then we plot the relationship between NOC and cost ratio in 5 levels. We found out that in the Medium level and extremely high level, the cost ratio has larger impact on the number of children.

  1. Investigated the HCR in different states After we get the average number of children in different states, we decided to estimate the average house cost ratio in each state and attach them to the whole map.

After that, we picked up the first five states and the last five states:

Then we got the plots of the first five and the last five, found out that there are relatively stronger relationship between house-related cost and number of children in households in these graphs.

Reflection

The team conducted an exploratory data analysis around children in hope of finding common features that families with more children share and what cause the discrepences between states. We praticularily looked into four aspects of households, employment status, income per person, household income and living costs. Across all of these subtopics, we have intersting findings and discoveries, expected and unexpected ones. In summary, difference in number of children between states can be mainly attributed to employment status of members of household and various living costs. While an very suprising finding with incomes is the more the family earns, generally the less children the family has. Future works could include finding third party data about states’ basic statistics, laws on abortion as well as incorperating person datasets into the study.